Support Vector Machines

Author

Termeh Shafie

1 Support Vector Machines

Support vector machines (SVMs) offer a direct approach to binary classification: try to find a hyperplane in some feature space that “best” separates the two classes. In practice, however, it is difficult (if not impossible) to find a hyperplane to perfectly separate the classes using just the original features. SVMs overcome this by extending the idea of finding a separating hyperplane in two ways:

  1. loosen what we mean by “perfectly separates”,
  2. use the so-called kernel trick to enlarge the feature space to the point that perfect separation of classes is (more) likely.

SVMs are quite versatile and have been applied to a wide variety of domains ranging from chemistry to pattern recognition. They are best used in binary classification scenarios. This brings up a question as to where SVMs are to be preferred to other binary classification techniques such as logistic regression. The honest response is, “it depends”, but here are some points to keep in mind when choosing between the two. A general point to keep in mind is that SVM algorithms tend to be expensive both in terms of memory and computation, issues that can start to hurt as the size of the dataset increases.

We use the e1071 library in R to demonstrate the support vector classifier and the SVM. Another option is the LiblineaR library, which is useful for very large linear problems.

1.1 Warm-Up: SVM on iris dataset

1.1.1 Training and test datasets

#load required library
library(e1071)

#load built-in iris dataset
data(iris)

#set seed to ensure reproducible results
set.seed(42)

#split into training and test sets
iris[, "train"] <- ifelse(runif(nrow(iris)) < 0.8, 1, 0)

#separate training and test sets
trainset <- iris[iris$train == 1,]
testset <- iris[iris$train == 0,]

#get column index of train flag
trainColNum <- grep("train", names(trainset))

#remove train flag column from train and test sets
trainset <- trainset[,-trainColNum]
testset <- testset[,-trainColNum]

dim(trainset)
[1] 115   5
dim(testset)
[1] 35  5

1.1.2 Build the SVM model

#get column index of predicted variable in dataset
typeColNum <- grep("Species", names(iris))

#build model – linear kernel and C-classification (soft margin) with default cost (C=1)
svm_model <- svm(Species~ ., data = trainset, 
                 method = "C-classification", 
                 kernel = "linear")
svm_model

Call:
svm(formula = Species ~ ., data = trainset, method = "C-classification", 
    kernel = "linear")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  1 

Number of Support Vectors:  24

1.1.3 Support Vectors

The output from the SVM model show that there are 24 support vectors. If desired, these can be examined using the SV variable in the model, i.e via svm_model$SV.

# support vectors
svm_model$SV
    Sepal.Length Sepal.Width Petal.Length Petal.Width
19   -0.25639203  1.76683447   -1.3228618  -1.3054201
42   -1.70055936 -1.70445193   -1.5591789  -1.3054201
45   -0.97847569  1.76683447   -1.2047033  -1.1709010
53    1.18777530  0.14690082    0.5676747   0.3088091
55    0.70638619 -0.54735646    0.3904369   0.3088091
57    0.46569164  0.60973900    0.4495161   0.4433282
58   -1.21917025 -1.47303284   -0.3775936  -0.3637864
69    0.34534436 -1.93587102    0.3313576   0.3088091
71   -0.01569747  0.37831991    0.5085954   0.7123664
73    0.46569164 -1.24161374    0.5676747   0.3088091
78    0.94708075 -0.08451828    0.6267539   0.5778473
84    0.10464981 -0.77877556    0.6858332   0.4433282
85   -0.61743386 -0.08451828    0.3313576   0.3088091
86    0.10464981  0.84115809    0.3313576   0.4433282
99   -0.97847569 -1.24161374   -0.5548314  -0.2292673
107  -1.21917025 -1.24161374    0.3313576   0.5778473
111   0.70638619  0.37831991    0.6858332   0.9814046
117   0.70638619 -0.08451828    0.9221503   0.7123664
124   0.46569164 -0.77877556    0.5676747   0.7123664
130   1.54881714 -0.08451828    1.0993881   0.4433282
138   0.58603892  0.14690082    0.9221503   0.7123664
139   0.10464981 -0.08451828    0.5085954   0.7123664
147   0.46569164 -1.24161374    0.6267539   0.8468855
150  -0.01569747 -0.08451828    0.6858332   0.7123664

1.1.4 Predictions on training and test model

# training set predictions
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$Species)
[1] 0.9826087
# test set predictions
pred_test <-predict(svm_model, testset)
mean(pred_test == testset$Species)
[1] 0.9142857

The test prediction accuracy indicates that the linear performs quite well on this dataset, confirming that it is indeed near linearly separable. To check performance by class, one can create a confusion matri. Another point is that we have used a soft-margin classification scheme with a cost \(C=1\). You can experiment with this by explicitly changing the value of \(C\). Again, I’ll leave this for you an exercise.

# confusion matrix
cm <- table(pred_test, testset$Species)
cm
            
pred_test    setosa versicolor virginica
  setosa         18          0         0
  versicolor      0          5         3
  virginica       0          0         9
# accuracy
sum(diag(cm)) / sum(cm)
[1] 0.9142857

1.2 SVM on Penguins Data

library(datasets) # penguins data come from here
library(tidyverse) 
# plot data
ggplot(penguins, aes(x = bill_len, y = bill_dep, color = species)) +
  geom_point() +
  theme_minimal() +
  labs(
    x = "Bill Length (in mm)",
    y = "Bill Depth (in mm)",
    title = "Bill Length vs. Bill Depth by Species"
  )
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

1.2.1 Linear, Separable Case

The penguins data set contains morphological measurements for individual penguins collected in the Palmer Archipelago, Antarctica. It includes observations from three species: Adélie, Chinstrap, and Gentoo—and records key physical characteristics such as bill length, bill depth, flipper length, body mass, sex, and island of origin. The data are commonly used for demonstrating exploratory data analysis, visualization, and classification methods, as the species show both overlap and separation in their physical traits.

Here, we merge the Adélie and Chinstrap groups and restrict the analysis to two dimensions, using bill depth and flipper width as the discriminating variables. A helper function is used to plot an approximate version of the margin. note that we use ggplot2 here it needs to be loaded.

# Helper to reproduce the plot (points, SV squares, margin lines)
sepLinePlot <- function(svm_fit, data,
                        xvar = "flipper_len",
                        yvar = "bill_dep") {

  # response + class colors
  yname <- all.vars(formula(svm_fit))[1]
  cls <- factor(data[[yname]])
  cols <- c("red", "green")[as.integer(cls)]

  plot(data[[xvar]], data[[yvar]],
       col = cols, pch = 1,
       xlab = xvar, ylab = yvar)

  # compute w and b from e1071 internals
  w <- as.numeric(t(svm_fit$coefs) %*% svm_fit$SV)
  names(w) <- colnames(svm_fit$SV)
  b <- -svm_fit$rho

  # pull the correct coefficients for the plotted axes
  wx <- w[[xvar]]
  wy <- w[[yvar]]

  xseq <- seq(min(data[[xvar]], na.rm = TRUE),
              max(data[[xvar]], na.rm = TRUE),
              length.out = 200)

  # if wy ~ 0 => vertical boundary in this coordinate system
  if (abs(wy) < 1e-12) {
    x0 <- -b / wx
    abline(v = x0, col = "cyan", lwd = 2)
    abline(v = (-b + 1) / wx, col = "cyan", lty = 2)
    abline(v = (-b - 1) / wx, col = "cyan", lty = 2)
  } else {
    y0  <- -(wx * xseq + b) / wy
    yP1 <- -(wx * xseq + b - 1) / wy
    yM1 <- -(wx * xseq + b + 1) / wy

    lines(xseq, y0,  col = "cyan", lwd = 2)
    lines(xseq, yP1, col = "cyan", lty = 2)
    lines(xseq, yM1, col = "cyan", lty = 2)
  }

  # support vectors as grey squares
  points(svm_fit$SV[, xvar],
         svm_fit$SV[, yvar],
         pch = 0, cex = 1.6, lwd = 1.2, col = "grey60")
}
library(ggplot2)

#merge Adelie & Chinstrap species to make the problem easier
penguins_12_3 <- penguins %>%
  mutate(species=dplyr::recode(species,`Adelie`="Adel/Chin",`Chinstrap`="Adel/Chin"))


# Fit model, no scaling (important for plotting lines in original units)
sv.fit.lin0 <- e1071::svm(
  species ~ bill_dep + flipper_len,
  data   = penguins_12_3,
  kernel = "linear",
  scale  = FALSE
)

# Plot
sepLinePlot(sv.fit.lin0, penguins_12_3)

The solid line represents the separating (hyper)plane, while the margins are shown as dashed lines. The margin is constructed by widening the separating line until it just touches points on either side of the boundary, known as support vectors, with the margin edges remaining parallel to the separating plane.

1.2.2 Non-Separable Case

As an example, consider the penguins data while ignoring the Gentoo species, which results in a non-separable case with imperfectly separated groups. The analysis is restricted to two dimensions, using bill length and bill depth as the discriminating variables. In this setting there is no clean separating line between the groups; instead, the algorithm accommodates this by allowing slack, permitting a limited amount of crossover within the margin, as will be discussed in more detail later.

# keep only Adelie and Chinstrap
penguins_1_2 <- penguins %>% 
  dplyr::filter(species != "Gentoo")

# fit linear SVM 
sv.fit.lin <- e1071::svm(
  species ~ bill_len + bill_dep,
  data   = penguins_1_2,
  kernel = "linear",
  scale  = FALSE
)

# plot using the helper (bill_len on x, bill_dep on y)
sepLinePlot(
  sv.fit.lin,
  penguins_1_2,
  xvar = "bill_dep",
  yvar = "bill_len"
)

This plot shows the result of fitting a linear soft-margin support vector machine to the Adélie and Chinstrap penguins using bill depth (horizontal axis) and bill length (vertical axis).

Each point represents an individual penguin, with colour indicating species. The solid cyan line is the estimated separating hyperplane, and the dashed cyan lines indicate the margins on either side. Because the two species overlap substantially in this two-dimensional feature space, there is no clean separating line. As a result, the fitted margin passes through dense regions of both classes.

The square-boxed points are the support vectors. In this non-separable setting, these include not only points lying exactly on the margin, but also points inside the margin and some that are on the wrong side of the separating line. These observations are the ones that determine the position and orientation of the hyperplane.

Many observations fall within the margin and several lie on the “wrong” side of the separating line. This illustrates the role of slack variables in a soft-margin SVM: the algorithm allows violations of the margin, and even misclassifications, in order to achieve the best overall trade-off between margin width and classification error. Given the restriction to two variables and a linear decision boundary, this plot represents about as good a separation as can be achieved for these two species.

Each element of the slack vector
\[ \boldsymbol{\varepsilon} = (\varepsilon_1, \ldots, \varepsilon_n)^\top \]

is known as the slack for a given observation. The slack variables quantify how far each point lies from the margin for its assigned group and can be viewed as allowances for observations that fall on the “wrong side” of the separating plane, playing a role analogous to residuals in regression.

When \(\varepsilon_i = 0\), observation \(i\) lies on the correct side of the margin. If \(0 < \varepsilon_i \leq 1\), observation \(i\) is still on the correct side of the separating plane but lies within the margin; such points are classified correctly, although in the case of perfectly separable groups no observations fall within the margin. When \(\varepsilon_i > 1\), observation \(i\) lies on the wrong side of the separating plane and is therefore misclassified.

The margin width \(M\) and the cost parameter \(C\) work together to control the use of slack. In this formulation, decreasing the cost \(C\) allows more slack, which in turn results in a larger number of support vectors and greater tolerance for margin violations.

1.2.3 Using e1071 Package

We have tried to align the above formulation with the implementation of the support vector machine in the svm function from the e1071 package. In particular, we use the cost parameter \(C\) to modify the separating plane and illustrate the consequences of this choice. Shown next is the final SVM classification for the second example using the default cost \(C = 1\), which results in several misclassified observations.

plot(sv.fit.lin,penguins_1_2,formula=bill_dep ~ bill_len)

In this plot, group membership is indicated by the colour of the symbols, from which it can be seen that there are seven misclassified observations. Points are shown with an open circle when they are not support vectors and with a cross when they are support vectors. The plot reveals a relatively large number of support vectors, in contrast to the seven support vectors observed in the completely separable case. We now decrease the cost parameter \(C\), forcing the margin \(M\) to shrink in order to accommodate the slack associated with the misclassified observations, which in turn increases the number of support vectors used. The advantage of decreasing \(C\), and hence increasing the number of support vectors, is that the resulting separating plane becomes less sensitive to a small number of outlying observations. The results for \(C = 0.01\) are shown next.

sv.fit.lin.lowCost <- svm(species ~ bill_len+bill_dep, data = penguins_1_2,kernel='linear',cost=0.01)
plot(sv.fit.lin.lowCost,penguins_1_2,formula=bill_dep ~ bill_len)

summary(sv.fit.lin.lowCost)

Call:
svm(formula = species ~ bill_len + bill_dep, data = penguins_1_2, 
    kernel = "linear", cost = 0.01)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  0.01 

Number of Support Vectors:  129

 ( 65 64 )


Number of Classes:  2 

Levels: 
 Adelie Chinstrap Gentoo

The summary of the fitted model shows that 129 support vectors were used, and the separating line can be seen to have shifted in an interesting way. There are now more misclassified observations, almost all of which belong to the Chinstrap species. This suggests that some of the original support vectors may have been overly influential; however, the adjustment has also introduced additional classification error.

1.3 The Kernel Trick

The kernel trick is a concept that underlies a wide range of machine learning techniques and is based on two key ideas. First, groups that are not linearly separable in \(p\) dimensions may become separable in a higher-dimensional space of dimension \(q > p\) through an appropriate transformation. Second, rather than explicitly computing this transformation, training the model in the higher-dimensional space, and then making predictions on new data, we can instead use a kernel function with a suitable inner product, which leads to an equivalent optimization problem and solution without ever working directly in the higher-dimensional space.

A rather extreme example using two circles helps illustrate the idea. Let one group be defined by the relationship
\(X_1^2 + X_2^2 = 2^2\),
and the other by
\(X_1^2 + X_2^2 = 1^2\).

If we transform the two-dimensional relationships into three dimensions using the mapping
\((X_1, X_2) \mapsto (X_1, X_2, X_1^2 + X_2^2)\),
the two groups become clearly separable.

Note that we need an extra package for 3D plots. We also plot the 2D original set of relationships for comparison.

library(plot3D) # install if needed

set.seed(1001)
x1 <- runif(100, -2, 2); y1 <- sqrt(4 - x1^2)
x2 <- runif(100, -1, 1); y2 <- sqrt(1 - x2^2)
df <- data.frame(
  x1 = c(x1, x1, x2, x2),
  x2 = c(y1, -y1, y2, -y2)
)
df$z <- df$x1^2 + df$x2^2
plot3D::scatter3D(df$x1, df$x2, df$z)

# 2d correspondence
plot(x2~x1,data=df,col=c(4,2,2,3)[round(z)]) 

Before going into details, we run svm on the circle data using several different kernels. First the linear:

sv.circ.lin <- svm(factor(z)~ x1+x2,data = df, kernel='linear')
plot(sv.circ.lin,df,formula=x2~x1,main="Linear Kernel")

Using the Radial Kernel we get:

sv.circ.rad <- svm(factor(z)~ x1+x2,data = df, kernel='radial')
plot(sv.circ.rad,df,formula=x2~x1)

1.3.1 Back to the Penguin Data

Here a support vector machine with a radial kernel is fitted to the penguin data using bill length and bill depth as predictors. The radial kernel implicitly maps the data into a higher-dimensional feature space, allowing the classifier to learn a non-linear decision boundary between the species. The subsequent plot displays this non-linear separating boundary in the original two-dimensional feature space, illustrating how the radial kernel can capture complex class structure that a linear SVM cannot.

sv.fit.rad <- svm(species ~ bill_len+bill_dep, data = penguins_1_2, kernel='radial')
plot(sv.fit.rad,penguins_1_2,formula=bill_dep ~ bill_len)

We can adjust the model parameters to improve classification performance, but care must be taken to avoid overfitting; in practice, methods such as cross-validation should be used to select these parameters.

Below, the parameter \(\gamma\) is increased to 2, which accentuates large differences between points.

sv.fit.rad.gam2 <- svm(species ~ bill_len+bill_dep, data = penguins_1_2,kernel='radial', gamma=2)
plot(sv.fit.rad.gam2, penguins_1_2, formula=bill_dep ~ bill_len)

1.3.2 SVM with more than two groups

The basic idea is to apply the support vector machine recursively, comparing one group against all remaining groups, then comparing the second group against groups three and higher, and so on. As a final example, all species in the penguins data are used, with both linear and radial kernel fits under default settings.

sv.fit.lin_all3 <- svm(species ~ bill_len + bill_dep, data = penguins,kernel='linear')
plot(sv.fit.lin_all3, penguins, formula=bill_dep ~ bill_len)

As before, Gentoo is easy to separate from the other species, but this multiclass extension does not substantially improve predictive performance. Distinguishing Adélie from Chinstrap remains difficult near the decision boundary, and the radial kernel provides only a modest improvement.

The radial variant is as follows:

sv.fit.rad_all3 <- svm(species ~ bill_len + bill_dep, data = penguins, kernel='radial')
plot(sv.fit.rad_all3,penguins,formula=bill_dep ~ bill_len)

1.4 Cross Validation

As done previously in this course, the basic idea is to withhold a portion of the training data and use it as a test set. First, one loops over the open tuning parameters, such as \(C\) and \(\gamma\), fitting the model on the training set while holding these parameters fixed. The fitted model is then evaluated on the test set, and this process is repeated for all parameter combinations. The parameter values that yield the best performance are then selected. We applied this approach to the penguins data, still attempting classification using only two features.

A grid search is conducted over \(C = 0.01, 0.1, 1, 10, 100\) and \(\gamma = 0.5, 1, 2, 3, 4\). Somewhat surprisingly, the default parameter values performed best in this case.

set.seed(1001)
tune.out <- tune(svm, species ~., data=penguins, kernel='radial', ranges = list(cost=c(0.01,0.1,1,10,100),gamma=c(0.5,1,2,3,4)))
summary(tune.out)

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
    1   0.5

- best performance: 0.01212678 

- Detailed performance results:
    cost gamma      error dispersion
1  1e-02   0.5 0.56054495 0.07977601
2  1e-01   0.5 0.03004679 0.02427130
3  1e+00   0.5 0.01212678 0.02119780
4  1e+01   0.5 0.02407531 0.02361808
5  1e+02   0.5 0.02407531 0.02361808
6  1e-02   1.0 0.56054495 0.07977601
7  1e-01   1.0 0.21904650 0.06312732
8  1e+00   1.0 0.01800914 0.02523584
9  1e+01   1.0 0.02113414 0.02468684
10 1e+02   1.0 0.02113414 0.02468684
11 1e-02   2.0 0.56054495 0.07977601
12 1e-01   2.0 0.53040393 0.08062374
13 1e+00   2.0 0.04503629 0.02956429
14 1e+01   2.0 0.04797746 0.03525477
15 1e+02   2.0 0.04797746 0.03525477
16 1e-02   3.0 0.56054495 0.07977601
17 1e-01   3.0 0.56054495 0.07977601
18 1e+00   3.0 0.09274589 0.07027869
19 1e+01   3.0 0.08980472 0.06407604
20 1e+02   3.0 0.08980472 0.06407604
21 1e-02   4.0 0.56054495 0.07977601
22 1e-01   4.0 0.56054495 0.07977601
23 1e+00   4.0 0.13162131 0.07053983
24 1e+01   4.0 0.12270865 0.06670692
25 1e+02   4.0 0.12270865 0.06670692
Note

Dispersion, as shown above, is analogous to a root mean squared error (RMSE), \[ \mathrm{rmse} = \sqrt{\frac{1}{n}\sum_i \left(\hat{y}_i - y_i\right)^2}, \] and is a quantity we seek to minimize. However, it is possible to choose parameter values that yield improved within-sample performance but do not translate into improved out-of-sample performance.

Our final model is thus:

sv.fit.rad_all3 <- svm(species ~ bill_len+bill_dep, data= penguins, kernel='radial', cost=1, gamma=.5)
plot(sv.fit.rad_all3, penguins, formula = bill_dep ~ bill_len)

1.5 Exercise: Radial Kernels and Nonlinear Decision Boundaries

This section explores the use of radial kernels to allow decision boundaries to be both non-linear and non-contiguous. As a reminder, the linear and radial (sometimes called Gaussian) kernels are defined as follows.

Linear kernel:

\[ K(\mathbf{X}_i, \mathbf{X}_j) = \langle \mathbf{X}_i, \mathbf{X}_j \rangle, \]

where \(\langle \cdot \rangle\) denotes the standard vector inner product, as defined in the notes.

Radial kernel:

\[ K(\mathbf{X}_i, \mathbf{X}_j) = \exp\!\left(-\gamma \lVert \mathbf{X}_i - \mathbf{X}_j \rVert^2\right). \]

This kernel is based on a distance measure and includes a single tuning parameter \(\gamma\), which controls the influence of large distances between observations. Larger values of \(\gamma\) emphasize local structure, while smaller values lead to smoother decision boundaries. We briefly explore the effect of this parameter here, with a more detailed discussion deferred to the next class.

For the following exercises, we use the Australian Crabs data (crabs), a benchmark data set that is notoriously difficult to cluster and relatively challenging to classify. The five primary predictors (measurements) are abbreviated as FL, RW, CL, CW, and BD. The data needs to be preprocessed so that sex and species are combined into a single categorical variable, group, with four levels:

  1. Blue Males
  2. Orange Males
  3. Blue Females
  4. Orange Females

In most analyses, we do not explicitly split the data into training and test sets. Instead, we rely on various forms of cross-validation to properly assess out-of-sample performance.

You are asked to compare and contrast SVM fits using the raw measures.

  • Run svm with the following formula using all five measures: group ~ FL + RW + CL + CW + BD, first with the linear kernel and then with the radial kernel.
  • Compare the accuracy (and optionally the confusion matrices) for these two runs.
  • If the radial-kernel fit performs worse, adjust the tuning parameter by setting gamma to 2, 4, or even 10, and then compare the accuracy of this updated fit.
  • If you would like to visualize the fits, try plotting RW vs. CW, but you must set slice to reasonable values for the remaining features.

Determine if the following statements are true/false based on your output:

  1. The crabs data, examined using raw features seems to be divisible into groups using linear ‘cuts’
  2. In order to perform well, use of the radial kernel requires higher penalties for large distances between observations
  3. Visualization of the svm solution is hindered by our choice for the slice?